Augmenting Pivot based SMT with word segmentation
نویسندگان
چکیده
This paper is an attempt to bridge two well known performance degraders in SMT, viz., (i) difference in morphological characteristics of the two languages, and (ii) scarcity of parallel corpora. We address these two problems using “word segmentation” and through “pivots” on the morphologically complex language. Our case study is Malayalam to Hindi SMT. Malayalam belongs to the Dravidian family of languages and is heavily agglutinative. Hindi is a representative of the Indo-Aryan language family and is morphologically simpler. We use triangulation as pivoting strategy in combination with morphological pre-processing. We observe that (i) significant improvement in translation quality over direct SMT occurs when a pivot is used in combination with direct SMT, (ii) the more the number of pivots, the better the performance and (iii)word segmentation is a must. We achieved an improvement of 9.4 BLEU points which is over 58% compared to the baseline direct system. Our work paves way for SMT of languages that face resource scarcity and have widely divergent morphological characteristics.
منابع مشابه
The TCH machine translation system for IWSLT 2008
This paper reports on the first participation of TCH (Toshiba (China) Research and Development Center) at the IWSLT evaluation campaign. We participated in all the 5 translation tasks with Chinese as source language or target language. For Chinese-English and English-Chinese translation, we used hybrid systems that combine rule-based machine translation (RBMT) method and statistical machine tra...
متن کاملDialect Translation: Integrating Bayesian Co-segmentation Models with Pivot-based SMT
Recent research on multilingual statistical machine translation (SMT) focuses on the usage of pivot languages in order to overcome resource limitations for certain language pairs. This paper proposes a new method to translate a dialect language into a foreign language by integrating transliteration approaches based on Bayesian co-segmentation (BCS) models with pivot-based SMT approaches. The ad...
متن کاملUnsupervised Morphological Segmentation for Statistical Machine Translation
Statistical Machine Translation (SMT) techniques often assume the word is the basic unit of analysis. These techniques work well when producing output in languages like English, which has simple morphology and hence few word forms, but tend to perform poorly on languages like Finnish with very complex morphological systems with a large vocabulary. This thesis examines various methods of augment...
متن کاملUtilizing Lexical Similarity between Related, Low-resource Languages for Pivot-based SMT
We investigate pivot-based translation between related languages in a low resource, phrase-based SMT setting. We show that a subword-level pivot-based SMT model using a related pivot language is substantially better than word and morphemelevel pivot models. It is also highly competitive with the best direct translation model, which is encouraging as no direct source-target training corpus is us...
متن کاملStatistical Machine Translation without a Source-side Parallel Corpus Using Word Lattice and Phrase Extension
Statistical machine translation (SMT) requires a parallel corpus between the source and target languages. Although a pivot-translation approach can be applied to a language pair that does not have a parallel corpus directly between them, it requires both source–pivot and pivot–target parallel corpora. We propose a novel approach to apply SMT to a resource-limited source language that has no par...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015